Finding Approximate Matches in Large Lexicons

نویسندگان

  • Justin Zobel
  • Philip W. Dart
چکیده

Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generating Semantic Orientation Lexicon using Large Data and Thesaurus

We propose a novel method to construct semantic orientation lexicons using large data and a thesaurus. To deal with large data, we use Count-Min sketch to store the approximate counts of all word pairs in a bounded space of 8GB. We use a thesaurus (like Roget) to constrain near-synonymous words to have the same polarity. This framework can easily scale to any language with a thesaurus and a unz...

متن کامل

Computing Semantic Similarity between Skill Statements for Approximate Matching

This paper explores the problem of computing text similarity between verb phrases describing skilled human behavior for the purpose of finding approximate matches. Four parsers are evaluated on a large corpus of skill statements extracted from an enterprise-wide expertise taxonomy. A similarity measure utilizing common semantic role features extracted from parse trees was found superior to an i...

متن کامل

Approximate Name Matching

Looking up a person’s name in a list is a common operation in information systems. The list can be a customer or employee database, a phone directory, a passenger list, etc. Finding an exact match to a certain name is easy. Personal names, however, are easily misspelled, especially in an international setting. Hence, there is a need for a solution which can find approximate matches to a name in...

متن کامل

AN ALGORITHM FOR FINDING THE EIGENPAIRS OF A SYMMETRIC MATRIX

The purpose of this paper is to show that ideas and techniques of the homotopy continuation method can be used to find the complete set of eigenpairs of a symmetric matrix. The homotopy defined by Chow, Mallet- Paret and York [I] may be used to solve this problem with 2""-n curves diverging to infinity which for large n causes a great inefficiency. M. Chu 121 introduced a homotopy equation...

متن کامل

REAFUM: Representative Approximate Frequent Subgraph Mining

Noisy graph data and pattern variations are two thorny problems faced by mining frequent subgraphs. Traditional exact-matching based methods, however, only generate patterns that have enough perfect matches in the graph database. As a result, a pattern may either remain undetected or be reported as multiple (almost identical) patterns if it manifests slightly different instances in different gr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Softw., Pract. Exper.

دوره 25  شماره 

صفحات  -

تاریخ انتشار 1995